Programmers Don’t Understand Hash Functions

Programmers don’t understand hash functions, and I can demonstrate this to most of the people that will read this with a single observation:

When you saw the words “hash function” in the title, you might have assumed this was going to be a blog post about password storage. (Passwords are the most common knee-jerk reaction I get to any discussion about hash functions, anyway. A little bit of security knowledge can be very dangerous.)

I don’t blame software developers for their lack of understanding on anything I’m going to discuss. The term “hash function” can accurately represent any of the following disparate topics in computer science:

Converting keys into memory addresses for hash maps
Creating a one-way message digest to ensure said was transmitted correctly over an untrusted network
Cryptographic building blocks for digital signatures, message authentication codes, key derivation functions, and so on
Server-side password storage techniques meant to resist brute force and precomputation attacks
Perceptual fingerprints of visual data

Some hash functions are reusable across multiple topics from the above list, but even then, not all of the hash functions you’d use for one purpose have the same properties.

Using a hash function for the wrong purpose, or in a place where it doesn’t provide the expected properties, can lead to security vulnerabilities. Some of these vulnerabilities aren’t obvious or straightforward, either, which only serves to magnify confusion.

So let’s talk about hash functions, what programmers get wrong about them, and the correct answers to these common misconceptions.

What Are Hash Functions?

A hash function is any function that can map an arbitrary-sized input to a fixed-size output.

This is incredibly vague; anything can be a hash function if you’re brave enough. You can treat an arbitrary string as an big integer and return the least significant bit and call this a hash function. (It’s not a particularly good one, but it fits the definition.)

function totallyHashed(input: string|Buffer): Buffer {
    const inBuf = Buffer.isBuffer(input) 
        ? input
        : Buffer.from(input);
    return Buffer.from([
        inBuf[inBuf.length - 1] & 1
    ]);
}

Being able to call something a hash function and be technically correct isn’t very helpful.

What Kinds of Hash Functions Do We Care About?

There are different types of hash functions suitable for solving different types of problems. (Some examples follow, but this should not be taken as an exhaustive treatment.)

General-Purpose Hash Functions

General-purpose hash functions are useful for converting a key into an index for memory addresses when constructing hash tables. Usually when someone says only “hash function” in a broad computer science context, without any qualifiers, that’s what they’re talking about.

Examples: SipHash, djb2, Murmur3. You can find a comparison here.

Worth calling out: Only some of these general purpose tables are safe to use for hash tables where the keys are provided from user input. For example: JSON data. When in doubt, SipHash-2-4 is a good default.

Cryptographic Hash Functions

Cryptographic hash functions have additional desirable properties (they’re non-invertible and must be resistant to collision attacks and preimage attacks) above general-purpose hash functions. They also have larger output sizes (typically at least 256 bits) than the sort of hash functions you’d use for hash tables. Consequently, they’re slower than the hash functions people tend to use for hash tables.

Cryptographic hash functions are often used in place of a random oracle in security protocols, because actual random oracles do not exist, as Thomas Pornin explains:

A random oracle is described by the following model:

There is a black box. In the box lives a gnome, with a big book and some dice.

We can input some data into the box (an arbitrary sequence of bits).

Given some input that he did not see beforehand, the gnome uses his dice to generate a new output, uniformly and randomly, in some conventional space (the space of oracle outputs). The gnome also writes down the input and the newly generated output in his book.

If given an already seen input, the gnome uses his book to recover the output he returned the last time, and returns it again.

So a random oracle is like a kind of hash function, such that we know nothing about the output we could get for a given input message m. This is a useful tool for security proofs because they allow to express the attack effort in terms of number of invocations to the oracle.
The problem with random oracles is that it turns out to be very difficult to build a really “random” oracle. First, there is no proof that a random oracle can really exist without using a gnome. Then, we can look at what we have as candidates: hash functions. A secure hash function is meant to be resilient to collisions, preimages and second preimages. These properties do not imply that the function is a random oracle.
Thomas Pornin

I’m intentionally eschewing a lot of detail about what makes a cryptographic hash function secure (e.g. bit diffusion even in reduced rounds), or how they achieve the desirable properties.

If you’re interested in those topics, leave a comment below and I’ll talk about that in a future post.

If you remember nothing else about cryptographic hash functions, just know that checksums (e.g. CRC32) are not cryptographic. (Yes, many hashing APIs include CRC32 alongside good options, but don’t be misled.)

Note: There’s a very pedantic mathematical discussion that can be had about whether or not cryptographic hash functions are truly one-way functions (which, like P = NP vs P != NP, is an open conjecture in mathematics). You don’t have to know, or even care, about this distinction–unless you’re making assumptions about this property in your designs.

Examples: BLAKE3, SHA256.

A Word Of Caution on Message Authentication Codes

In cryptography, message authentication codes are often built with cryptographic hash functions, but not always!

AES-GCM famously uses a function called GHASH, which evaluates a polynomial in ${GF}(2^{128})$ , rather than a cryptographic hash function. This provides a great speed boost, but fails to provide collision resistance, leading to interesting results.

Poly1305 is a similar story to GCM, but doesn’t reuse keys the same way.

Although GHASH and Poly1305 are secure MACs, they’re not built from cryptographic hash functions.

CBC-MAC uses a block cipher (typically AES) in cipher-block chaining mode (with an all-zero initialization vector) and outputs the last block as the authentication tag. This offers no collision resistance (as Mega learned the hard way).

When in doubt, HMAC is a safe choice.

Password Hashing Functions

Separate from, but often built from, cryptographic hash functions are password hashing functions. These are the tools you use if your users are sending a username and password over HTTP and you don’t want to become the next RockYou.

Examples: Argon2, scrypt, bcrypt.

Password hashing functions have significant overlap with key-derivation functions (KDFs), but not all KDFs are meant for password hashing, and not all password hashes are meant for key derivation.

It’s perfectly reasonable to use bcrypt to store passwords, but not to derive encryption keys.

Conversely, HKDF is a valid KDF (which has a stricter set of security requirements than a PRF), but it’s not meant for passwords.

Some algorithms (Argon2 and scrypt) can be safely used as a password hashing function or as a KDF.

Modern password hashing algorithms involve a deliberately expensive computation that’s fast to verify once, but expensive to verify multiple times. These algorithms can be tuned (memory usage, parallel threads, number of iterations) to target a specific latency goal. Additionally, most password hashing APIs take care of salt generation for you, using a secure random generator.

Perceptual Hashes

In yet another different direction, you have perceptual hash functions, such as the kind Apple is going to use to violate the privacy of their customers in a desperate attempt to catch a small percentage of depraved individuals peddling CSAM (and expecting the rest of their customers to blindly trust that this capability they built for themselves will never ever be used to stifle dissidents or whistleblowers).

This dog, just can’t stop colliding. pic.twitter.com/bUyjkK2HR7
— Matthew Green (@matthew_d_green) August 19, 2021

Source here

I don’t have a lot to say here, except that I don’t trust Apple, especially after the internal memo about “screeching voices of the minority”: Their partners on this project have showed the utmost contempt for privacy activists, LGBTQ+ folks, survivors of child sexual abuse, etc. and they remain committed to them. Fuck Apple.

Suffice to say, cryptographers were not at all surprised by the discovery of practical collisions against Apple’s new perceptual hash function, because perceptual hash functions do not provide the same properties as cryptographic hash functions.

Perceptual hashes of CSAM do not provide collision or preimage resistance, and it would be possible to flood Apple with false positives if a hash of such material were to ever leak publicly. (Maybe an enterprising Internet Troll will one day make a meme generator that does this?)

How Developers Misuse Hash Functions

There are far too many ways that hash functions get misused by software developers to recount in one blog post.

Some of the more common and obvious examples (i.e., using MD5 to store passwords, calling SHA256 “encryption”) aren’t very interesting, and are covered elsewhere, so I’m going to focus on misuse examples that aren’t commonly discussed online.

Encrypt and Hash

A common design pattern from the 2010’s is to hash some data, then encrypt the data that was hashed (but not the hash), and then send both values to another machine.

The expectation here is that, upon decrypting the ciphertext, the hash can be used as a client-side fingerprint to ensure the data wasn’t corrupted in storage.

This use of a hash function is distinct from the Encrypt/MAC discussion (see: the Cryptographic Doom Principle), because it’s often implemented alongside AEAD. (If you aren’t using authenticated encryption, correct that first.)

However, there are two problems here:

Directly invoking cryptographic hash functions doesn’t involve a cryptographic secret, and thus all clients will produce the same hash of the same plaintext.
A cryptographic hash can be used to perform offline attacks (e.g. rainbow tables) against the plaintext, especially if the input domain is small (i.e. a phone number).

What you really want to use in this situation is HMAC with a static secret key (which is only known client-side).

HMAC ensures that, without access to the secret key, precomputation attacks are not possible. Additionally, if each client has a different secret key (which, they SHOULD), an attacker who only has access to hashes and ciphertext cannot distinguish which records correspond across multiple clients.

This is a weaker proposition than security against chosen plaintexts (IND-CPA), but still provides a higher level of assurance than a naked SHA256 hash.

Searchable Encryption with Hashed Indexes

Similar to the previous example, sometimes developers will be tasked with storing encrypted values in a database while also being able to quickly query the database based on an encrypted value.

The laziest solution to the “encrypted database” use-case is to just use deterministic encryption, such as AES-ECB or with a static IV/nonce. Most people who have some familiarity with cryptography immediately recognize that this is dangerous, so they opt to encrypt securely (i.e. AEAD with random nonces).

However, to support querying, they often just hash their plaintext and store it alongside the ciphertext.

This reintroduces the issues from the previous section (especially rainbow tables), but with additional risks:

Plaintexts are overwhelmingly likely to have smaller input domains, thereby increasing the utility of hashes to attack the confidentiality of the plaintext.
No domain separation between different hashes of different encrypted fields.

To address this, there are a few things you can do:

Truncate your hashes. If you want to frustrate attackers, simply don’t store a full hash of the plaintext. Instead, truncate hashes to 8 or fewer hexits and permit a small degree of false positives in your decryption logic (n.b. by filtering those rows out).
Use distinct HMAC keys per index. This introduces the solution to the previous section, but also addresses domain separation.

If you’re left wondering, “Can I use both solutions?” The answer is, “Yes, but you just reinvented what CipherSweet calls blind indexes.”

Overconfidence With Collision-Resistance

Previously on this blog, I disclosed a trivial collision vulnerability in the Kerl hash function used by the Iota cryptocurrency project without breaking the underling hash function (Keccak384).

How did I do this? I found multiple input values that, before being passed to the hash function, collide with each other.

Developers are frequently overconfident about the collision-resistance of their protocol, simply because they use collision-resistant hash functions inside of the protocol. They’re frequently blind to the reality of canonicalization attacks (including the somewhat famous length-extension attack).

This isn’t their fault. If you’ve made this mistake, it isn’t your fault. Cryptography is a difficult topic and requires a lot of experience and expertise to get right.

Closing Thoughts

One of my favorite cryptography websites is for the SPHINCS project, which is a stateless post-quantum hash-based digital signature algorithm.

On this webpage, which has a heavy anarchist aesthetic, there’s a special note to law-enforcement agents that reads:

The word “state” is a technical term in cryptography. Typical hash-based signature schemes need to record information, called “state”, after every signature. Google’s Adam Langley refers to this as a “huge foot-cannon” from a security perspective. By saying “eliminate the state” we are advocating a security improvement, namely adopting signature schemes that do not need to record information after every signature. We are not talking about eliminating other types of states. We love most states, especially yours! Also, “hash” is another technical term and has nothing to do with cannabis.
SPHINCS: Introduction

Now, I personally like this disclaimer because a) it’s funny and b) it reminds us that all cops are bastards.

But it’s also a good reminder of how confusing the term “hash” can be in different fields. Software engineers aren’t the only people who are likely to be confused about hash functions.

(And I can’t even apply the “-ish” suffix to talk about things that behave like hash functions but aren’t hash functions, because that’s a homograph for an even more specific drug term.)

The next time you see a programmer ask a potentially unwise question involving hash functions, having read this blog post, I hope you’ll appreciate how confusing all this shit can be for virtually everyone.

If you’ve made any of the specific mistakes I’ve discussed here, know that you’re in very good company. Some of the best programmers in the industry have made these mistakes before. Hell, I’ve made these exact mistakes before, and worse.

By Soatok

Security engineer with a fursona. Ask me about dholes or Diffie-Hellman!

View Archive